238 PART 5 Looking for Relationships with Correlation and Regression

one level for the reference group (let’s choose 3), and then create two binary indi-

cator variables for the other two levels  — meaning one for 1 = graduated high

school and 2 = graduated college. Here’s another example of coding multilevel

categorical variables as a set of indicator variables, where each level is assigned its

own binary variable that is coded 1 if the level applies to the row, and 0 if it does

not (see Table 17-1).

Table 17-1 shows theoretical coding for a data set containing the variables StudyID

(for participant ID) and PrimaryDx (for participant primary diagnosis). As shown

in Table 17-1, you take each level and make an indicator variable for it: Hyperten-

sion is HTN, diabetes is Diab, cancer is Cancer, and other is OtherDx. Instead of

including the variable PrimaryDx in the model, you’d include the indicator vari-

ables for all levels of PrimaryDx except the reference level. So, if the reference level

you selected for PrimaryDx was hypertension, you’d include Diab, Cancer, and

OtherDx in the regression, but would not include HTN. To contrast this to the edu-

cation example, in the set of variables in Table 17-1, participants can have a 1 for

one or more indicator variables or just be in the reference group. However, with

the education example, they can only be coded at one level, or be in the reference

group.

Don’t forget to leave the reference-level indicator variable out of the regression,

or your model will break!

Creating scatter charts before you jump

into multiple regression analysis

One common mistake researchers make is immediately running a regression or

another advanced statistical analysis before thoroughly examining their data. As

TABLE 17-1

Coding a Multilevel Category into a Set of Binary

Indicator Variables

StudyID

PrimaryDx

HTN

Diab

Cancer

OtherDx

1

Hypertension

1

0

0

0

2

Diabetes

0

1

0

0

3

Cancer

0

0

1

0

4

Other

0

0

0

1

5

Diabetes

0

1

0

0